Data Description:

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

Domain:

Cement manufacturing

Context:

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Attribute Information:

  • Cement: measured in kg in a m3 mixture
  • Blast: measured in kg in a m3 mixture
  • Fly ash : measured in kg in a m3 mixture
  • Water : measured in kg in a m3 mixture
  • Superplasticizer : measured in kg in a m3 mixture
  • Coarse Aggregate : measured in kg in a m3 mixture
  • Fine Aggregate : measured in kg in a m3 mixture
  • Age: day (1~365)
  • Concrete compressive strength :measured in MPa

Objective:

Modeling of strength of high performance concrete using Machine Learning

Importing Relevant Libraries

In [376]:
#for data manipulation
import pandas as pd
import numpy as np
#for ploting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)
#data tranformation and feature generation
from scipy.stats import zscore,norm
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from scipy.stats import pearsonr

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.svm import SVR
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor,BaggingRegressor)
from sklearn.preprocessing import PolynomialFeatures
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans
#importing the K fold
from sklearn.model_selection import KFold
#importing the cross validation score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample
from matplotlib import pyplot

Reading the dataset

In [2]:
#reading the data

data=pd.read_csv("concrete.csv")
In [3]:
data.head(10)
Out[3]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90 21.86
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7 15.75
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56 36.64
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28 21.65
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28 28.99
  • From above table we found that there are 8 independent variable and one dependent variable
  • All the records are numeric

Univariate Analysis

In [4]:
data.dtypes #to find the data types of each attributes
Out[4]:
cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
strength        float64
dtype: object
In [5]:
data.shape #no of rows and columns in the dataframe
Out[5]:
(1030, 9)

There are 1030 rows and 9 columns in the given dataset

Checking the presence of missing values

In [6]:
data.isnull().sum()
Out[6]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

There are no missing values in the given dataset

Descriptive statistics of each & every column

In [7]:
data.describe().transpose()
Out[7]:
count mean std min 25% 50% 75% max
cement 1030.0 281.167864 104.506364 102.00 192.375 272.900 350.000 540.0
slag 1030.0 73.895825 86.279342 0.00 0.000 22.000 142.950 359.4
ash 1030.0 54.188350 63.997004 0.00 0.000 0.000 118.300 200.1
water 1030.0 181.567282 21.354219 121.80 164.900 185.000 192.000 247.0
superplastic 1030.0 6.204660 5.973841 0.00 0.000 6.400 10.200 32.2
coarseagg 1030.0 972.918932 77.753954 801.00 932.000 968.000 1029.400 1145.0
fineagg 1030.0 773.580485 80.175980 594.00 730.950 779.500 824.000 992.6
age 1030.0 45.662136 63.169912 1.00 7.000 28.000 56.000 365.0
strength 1030.0 35.817961 16.705742 2.33 23.710 34.445 46.135 82.6

Five Point Summary

In [8]:
summary=data.describe().transpose()
summary[['min','25%','50%','75%','max']]
Out[8]:
min 25% 50% 75% max
cement 102.00 192.375 272.900 350.000 540.0
slag 0.00 0.000 22.000 142.950 359.4
ash 0.00 0.000 0.000 118.300 200.1
water 121.80 164.900 185.000 192.000 247.0
superplastic 0.00 0.000 6.400 10.200 32.2
coarseagg 801.00 932.000 968.000 1029.400 1145.0
fineagg 594.00 730.950 779.500 824.000 992.6
age 1.00 7.000 28.000 56.000 365.0
strength 2.33 23.710 34.445 46.135 82.6
In [9]:
data.skew(numeric_only = True)
Out[9]:
cement          0.509481
slag            0.800717
ash             0.537354
water           0.074628
superplastic    0.907203
coarseagg      -0.040220
fineagg        -0.253010
age             3.269177
strength        0.416977
dtype: float64

Skewness with positive values indicates data is skewed towards right. Skewness with negative values indicates data is skewed towards left

In [10]:
# A quick check to find columns that contain outliers
fig=plt.figure(figsize=(15,7))
ax=sns.boxplot(data=data,orient='v')

From the above graph we found that the attributes other than cement,ash and coarseagg are having outliers

Distribution of Independent attributes

cement

In [11]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.cement,showfliers=True,color="c").set_title("Distribution Of Cement")
# distplot
ax=plt.subplot(3,3,2)
sns.distplot(data.cement,color='m').set_title("cement Vs Frequency")
ax.axvline(data.cement.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.cement.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')
#histogram plot
plt.subplot(3,3,3)
data.cement.plot.hist(color='g').set_title("cement Vs Frequency");

Observation

  • The column is almost Normally Distributed
  • Mean and Median are almost the same
  • Most of the values are between 192 to 350
  • There are no outliers

slag

In [12]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.slag,showfliers=True,color='c').set_title("Distribution of Slag")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.slag,color='m').set_title("slag Vs Frequency")
ax.axvline(data.slag.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.slag.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.slag.plot.hist(color='g').set_title("slag Vs Frequency");

Observation

  • The column is skewed towards right
  • Mean and Median are not the same
  • 25% of the values are 0
  • Most of the values are between 0 to 142
  • There are outliers
  • It has Three Gaussians

Ash

In [13]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.ash,showfliers=True,color='c').set_title("Distribution of Ash")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.ash,color='m').set_title("ash Vs Frequency")
ax.axvline(data.ash.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.ash.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.ash.plot.hist(color='g').set_title("ash Vs Frequency");

Observation

  • The column is skewed towards right
  • 50% of the values are 0
  • Mean and Median are not the same
  • Most of the values are between 0 to 118
  • There are no outliers
  • It has Two Gaussians

water

In [14]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.water,showfliers=True,color='c').set_title("Distribution of water")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.water,color='m').set_title("water Vs Frequency")
ax.axvline(data.water.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.water.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.water.plot.hist(color='g').set_title("water Vs Frequency");

Observation

  • The column is slightly skewed towards left
  • Mean and Median are not the same
  • Most of the values are between 164 to 192
  • There are outliers
  • It has Three Gaussians

superplastic

In [15]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.superplastic,showfliers=True,color='c').set_title("Distribution of superplastic")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.superplastic,color='m').set_title("superplastic Vs Frequency")
ax.axvline(data.superplastic.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.superplastic.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.superplastic.plot.hist(color='g').set_title("superplastic Vs Frequency");

Observation

  • The column is skewed towards right
  • Mean and Median are almost the same
  • 25% of the values are 0
  • Most of the values are between 0 to 10
  • There are outliers
  • It has Two Gaussians

coarseagg

In [16]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.coarseagg,showfliers=True,color='c').set_title("Distribution of coarseagg")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.coarseagg,color='m').set_title("coarseagg Vs Frequency")
ax.axvline(data.coarseagg.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.coarseagg.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.coarseagg.plot.hist(color='g').set_title("coarseagg Vs Frequency");
In [ ]:
 

Observation

  • The column is almost Normally Distributed
  • Mean and Median are almost the same
  • Most of the values are between 932 to 1029
  • There are no outliers
  • It has Three Gaussians

fineagg

In [17]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.fineagg,showfliers=True,color='c').set_title("Distribution of fineagg")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.fineagg,color='m').set_title("fineagg Vs Frequency")
ax.axvline(data.fineagg.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.fineagg.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.fineagg.plot.hist(color='g').set_title("fineagg Vs Frequency");

Observation

  • The column is almost Normally Distributed
  • Mean and Median are almost the same
  • Most of the values are between 731 to 824
  • There are outliers
  • It has Two Gaussians

age

In [18]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.age,showfliers=True,color='c').set_title("Distribution of age")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.age,color='m').set_title("age Vs Frequency")
ax.axvline(data.age.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.age.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.age.plot.hist(color='g').set_title("age Vs Frequency");

Observation

  • The column is skewed towards right
  • Mean and Median are not the same
  • Most of the values are between 7 to 56
  • There are lot of outliers
  • It has Multiple Gaussians
In [19]:
agedata=data.copy()
def age_bin(data):
    if data.age <= 30:
        return '1 month'
    if data.age > 30 and data.age <= 60 :
        return '2 months'
    if data.age > 60 and data.age <= 90 :
        return '3 months'
    if data.age > 90 and data.age <= 120 :
        return '4 months'
    if data.age > 120 and data.age <= 150 :
        return '5 months'
    if data.age > 150 and data.age <= 180 :
        return '6 months'
    if data.age > 180 and data.age <= 210 :
        return '7 months'
    if data.age > 210 and data.age <= 240 :
        return '8 months'
    if data.age > 240 and data.age <= 270 :
        return '9 months'
    if data.age > 270 and data.age <= 300 :
        return '10 months'
    if data.age > 300 and data.age <= 330 :
        return '11 months'
    if data.age > 330 :
        return '12 months'
agedata['age_in_months'] = agedata.apply(lambda data:age_bin(data) , axis=1)
In [20]:
ax=plt.figure(figsize=(10, 6))
sns.countplot(agedata['age_in_months'], order = ['1 month', '2 months', '3 months', '4 months', '6 months', '9 months', '12 months'])
print(agedata['age_in_months'].value_counts())
1 month      749
2 months      91
4 months      77
3 months      54
6 months      26
12 months     20
9 months      13
Name: age_in_months, dtype: int64
  • From the above plot we found that one month bin has maximum occurence

Target Column (strength) Distribution

strength

In [21]:
plt.figure(figsize=(20,20))
#boxplot
plt.subplot(3,3,1)
sns.boxplot(data.strength,showfliers=True,color='c').set_title("Distribution of strength")

#dist plot
ax=plt.subplot(3,3,2)
sns.distplot(data.strength,color='m').set_title("strength Vs Frequency")
ax.axvline(data.strength.mean(),color='r',linestyle='--',label='Mean',linewidth=1.2)
ax.axvline(data.strength.median(),color='g',linestyle='--',label='Median',linewidth=1.2)
ax.legend(loc='best')

#histogram plot
plt.subplot(3,3,3)
data.strength.plot.hist(color='g').set_title("strength Vs Frequency");

Observation

  • The column is skewed towards right
  • Mean and Median are not the same
  • 25% of the values are 0
  • Most of the values are between 0 to 142
  • There are outliers

Multivariate Analysis

Influence of Different attributes on Strength

In [22]:
for col in list(data.columns)[:-1]:
    fig,ax1=plt.subplots(figsize=(15,7.5),ncols=1,sharex=False)
    sns.regplot(x=data[col],y=data['strength'],ax=ax1).set_title(f'{col} Vs strength')
In [23]:
sns.pairplot(data,diag_kind='kde');
In [24]:
plt.figure(figsize=(25,25))
ax=sns.heatmap(data.corr(),vmax=.8,square=True,fmt='.2f',annot=True,linecolor='white',linewidths=0.01)
plt.title('Correlation of Attributes')
plt.show()

Based on Pair Plot and Correlation Matrix we Understand that

  • Cement does not have any significiant relationship with other Independent attributes But it is positively associated with the targest attribute strength, the relation is not very strong

  • Slag Does not have any significiant relation with any of the attributes

  • Ash Does not have any significiant relation with any of the attributes

  • water Has a Negative Association with superplastic and fineagg and there is no other significant relation with any other attributes

  • superplastic has Negative association with water and Positive Association with Ash and strength buth this relation is not so strong

  • courseagg Does not have any significiant relation with any of the attributes

  • age has a very slight positive association with strength

Outlier Treatment

In [25]:
#creating the copy of orginal data set
data1=data.copy()
In [26]:
data1.boxplot(figsize=(40,20));
In [27]:
_, bp = data1.boxplot(return_type='both', figsize=(20,10), rot='vertical')

fliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
caps = [cap.get_ydata() for cap in bp['caps']]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
In [28]:
for idx, col in enumerate(data1.columns):
    print('Number of outliers in ',col, '-', len(fliers[idx]))
Number of outliers in  cement - 0
Number of outliers in  slag - 2
Number of outliers in  ash - 0
Number of outliers in  water - 9
Number of outliers in  superplastic - 10
Number of outliers in  coarseagg - 0
Number of outliers in  fineagg - 5
Number of outliers in  age - 59
Number of outliers in  strength - 4
In [29]:
for idx, col in enumerate(data1.columns):
    q1 = data1[col].quantile(0.25)
    q3 = data1[col].quantile(0.75)
    low = q1 - 1.5*(q3 - q1)
    high = q3 + 1.5*(q3 - q1)

    data1.loc[(data1[col] < low), col] = caps[idx * 2][0]
    data1.loc[(data1[col] > high), col] = caps[idx * 2 + 1][0]
In [30]:
# Check the dataset after Outlier treatment
 # Check the dataset after Outlier treatment
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(data = data1.iloc[:, 0:18], orient = 'h')
  • All the outliers are treated

Feature Engineering

Principal Component Analysis (PCA)

In [31]:
#Standardise the data
from scipy.stats import zscore
data_scaled=data1.apply(zscore)
In [32]:
data_scaled
Out[32]:
cement slag ash water superplastic coarseagg fineagg age strength
0 -1.339017 1.603837 -0.847144 1.038806 -1.070393 -0.014398 -0.312289 -0.276792 -0.354999
1 -1.074790 -0.367612 1.096078 -1.099025 0.812800 1.388141 0.287169 -0.683574 -0.737503
2 -0.298384 -0.857572 0.648965 0.277322 -0.111360 -0.206121 1.104041 -0.276792 -0.395168
3 -0.145209 0.466016 -0.847144 2.197586 -1.070393 -0.526517 -1.298819 -0.276792 0.601862
4 -1.209776 1.271779 -0.847144 0.556375 0.516371 0.958372 -0.963273 -0.276792 -1.050462
... ... ... ... ... ... ... ... ... ...
1025 -1.399330 -0.857572 1.747988 -0.072677 0.673304 -0.153365 0.397761 -0.276792 -1.350230
1026 2.394626 -0.857572 -0.847144 -1.879428 3.009858 -1.554617 1.512477 -1.003188 0.329072
1027 -0.045645 0.489237 0.564545 -0.091596 0.481497 -1.323005 -0.063457 -0.276792 0.507734
1028 0.582373 -0.416376 -0.847144 2.197586 -1.070393 -0.526517 -1.298819 2.396346 1.154035
1029 2.477915 -0.857572 -0.847144 -0.403757 -1.070393 1.956877 -2.015153 -0.886965 1.007148

1030 rows × 9 columns

In [33]:
#Creating a Covariance Matrix

cov_matrix=np.cov(data_scaled.T)
print('Covariance Matrix \n%s',cov_matrix)
Covariance Matrix 
%s [[ 1.00097182 -0.27567318 -0.39785361 -0.08179919  0.07313521 -0.10945526
  -0.22712896  0.04880104  0.49851342]
 [-0.27567318  1.00097182 -0.32396947  0.10681438  0.04530973 -0.28447347
  -0.28437099 -0.05201993  0.134922  ]
 [-0.39785361 -0.32396947  1.00097182 -0.25918954  0.40318139 -0.00997051
   0.08221025 -0.08302168 -0.10562992]
 [-0.08179919  0.10681438 -0.25918954  1.00097182 -0.6686933  -0.17920813
  -0.44805761  0.17630894 -0.29238688]
 [ 0.07313521  0.04530973  0.40318139 -0.6686933   1.00097182 -0.25930063
   0.21257046 -0.11224322  0.36634444]
 [-0.10945526 -0.28447347 -0.00997051 -0.17920813 -0.25930063  1.00097182
  -0.17540462  0.01295939 -0.1654945 ]
 [-0.22712896 -0.28437099  0.08221025 -0.44805761  0.21257046 -0.17540462
   1.00097182 -0.08726267 -0.17119166]
 [ 0.04880104 -0.05201993 -0.08302168  0.17630894 -0.11224322  0.01295939
  -0.08726267  1.00097182  0.48084423]
 [ 0.49851342  0.134922   -0.10562992 -0.29238688  0.36634444 -0.1654945
  -0.17119166  0.48084423  1.00097182]]
In [34]:
#Calculate Eigen Values & Eigen Vectors
e_vals,e_vecs=np.linalg.eig(cov_matrix)
print('Eigen Values \n%s' %e_vals)
print('\nEigen Vectors \n%s'%e_vecs)
Eigen Values 
[2.23526821 1.97557419 0.02866252 0.11646618 0.22478059 1.41255364
 0.89180319 1.11377886 1.00985897]

Eigen Vectors 
[[-0.01001461  0.50873888  0.48150913 -0.30485598  0.04165939 -0.32609802
  -0.31424447  0.45766615  0.00359583]
 [-0.16974728  0.14173165  0.45305068 -0.20392733 -0.08653888  0.69428825
   0.33769052 -0.04178465  0.31456436]
 [ 0.37222303 -0.27982934  0.38031908 -0.17492266 -0.3298988  -0.02234756
  -0.4796712  -0.513255    0.07111424]
 [-0.57648036 -0.06416759  0.35360609  0.54268862  0.18108156  0.097407
  -0.3480443  -0.13591935 -0.24923311]
 [ 0.56498725  0.17568681  0.06235528  0.29730469  0.68130224  0.21736901
  -0.1273612  -0.04639141  0.16688387]
 [-0.07347676 -0.19387555  0.33259397  0.2271219   0.09751799 -0.5641748
   0.40555695 -0.12134447  0.53647595]
 [ 0.36816662 -0.2230147   0.41234402  0.1843876  -0.12192342 -0.00631552
   0.41303628  0.23991706 -0.60632254]
 [-0.11499654  0.3409747   0.04775779 -0.33957781  0.31504764 -0.18789224
   0.27856345 -0.62624198 -0.38546469]
 [ 0.16378547  0.63578461 -0.07999271  0.50108013 -0.51122298 -0.02739533
   0.08427545 -0.20068659  0.04466802]]
In [35]:
# the "cumulative variance explained" analysis 
tot=sum(e_vals)
var_exp=[(i/tot)*100 for i in sorted(e_vals,reverse=True)]
cum_var_exp=np.cumsum(var_exp)
print("Cumulative Variance Explained",cum_var_exp)
Cumulative Variance Explained [ 24.81220049  46.74171331  62.4215159   74.78482173  85.99458307
  95.89388707  98.38902436  99.6818367  100.        ]
In [36]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.figure(figsize=(15,10))
plt.axhline(y=95, color='r', linestyle=':')
plt.bar(range(1, e_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, e_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

Observations:

  • The above plot shows that, post PCA, employing 5 principal components we are able to explain more than 95% of the variance in the dataset.
In [37]:
# Create a new matrix using the n components
pca=PCA(n_components=6).fit_transform(data_scaled)
In [38]:
#Converting PCA Transformed data from Array to Dataframe to visualise in the pairplot
pca_df=pd.DataFrame(pca)
sns.pairplot(pca_df,diag_kind='kde')
Out[38]:
<seaborn.axisgrid.PairGrid at 0x7f91aa7f3850>

Observation

  • From above graphs we found that all the data points are in a cloud distribution
  • There is no Relation between any components
  • We have to drop 3 Features from the dataset for getting 95% Variance
  • We will drop attributes instead of PCA

Decide on complexity of the model, should it be simple linear model in terms of parameters or would a quadratic or higher degree help

In [39]:
X=data_scaled.drop(['strength'],axis=1)
y=data_scaled.strength
X.head()
Out[39]:
cement slag ash water superplastic coarseagg fineagg age
0 -1.339017 1.603837 -0.847144 1.038806 -1.070393 -0.014398 -0.312289 -0.276792
1 -1.074790 -0.367612 1.096078 -1.099025 0.812800 1.388141 0.287169 -0.683574
2 -0.298384 -0.857572 0.648965 0.277322 -0.111360 -0.206121 1.104041 -0.276792
3 -0.145209 0.466016 -0.847144 2.197586 -1.070393 -0.526517 -1.298819 -0.276792
4 -1.209776 1.271779 -0.847144 0.556375 0.516371 0.958372 -0.963273 -0.276792

Splitting of values

In [40]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
In [41]:
print("Shape of Training Data :",X_train.shape)
print("Shape of Testing Data :",X_test.shape)
Shape of Training Data : (721, 8)
Shape of Testing Data : (309, 8)

Linear Regression

In [42]:
lr_model=LinearRegression()
lr_model.fit(X_train,y_train)
lr_trainscore=lr_model.score(X_train,y_train)
lr_testscore=lr_model.score(X_test,y_test)
In [43]:
#coefficients 
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, lr_model.coef_[idx]))
The coefficient for cement is 0.7584917548601813
The coefficient for slag is 0.5241006389242318
The coefficient for ash is 0.3098468697528367
The coefficient for water is -0.1817225803453452
The coefficient for superplastic is 0.11167490736424873
The coefficient for coarseagg is 0.06882315251500634
The coefficient for fineagg is 0.10559893643279736
The coefficient for age is 0.5505055343511238
  • From above we found that Superplastic & coarseagg has very weak co-efficients

Regularizing the model

- Ridge Model

In [44]:
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
ridge_trainscore=ridge.score(X_train,y_train)
ridge_testscore=ridge.score(X_test,y_test)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 0.75396061  0.51968804  0.30589086 -0.1846488   0.11175905  0.06585629
  0.10177101  0.55012074]

Lasso Model

In [45]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train,y_train)
lasso_trainscore=lasso.score(X_train,y_train)
lasso_testscore=lasso.score(X_test,y_test)
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 0.40537318  0.16156057  0.         -0.12698781  0.19074579 -0.
 -0.          0.40136114]
  • We found that the coefficients of ash,coarseagg & fineagg is 0 in Lasso Model ,So we can drop these features
In [46]:
score={'Train Score' : {'Regression' : lr_trainscore,
                        'Ridge':ridge_trainscore,
                        'Lasso':lasso_trainscore},
      'Test Score':{'Regression' : lr_testscore,
                        'Ridge':ridge_testscore,
                        'Lasso':lasso_testscore}}
score_df=pd.DataFrame(score)
score_df
Out[46]:
Train Score Test Score
Regression 0.734879 0.738512
Ridge 0.734876 0.738682
Lasso 0.656046 0.650420
In [325]:
results = pd.DataFrame({'Method':['Regression'], 'Train Accuracy': lr_trainscore,'Test Accuracy':lr_testscore},index={'1'})
tempresultsdf=pd.DataFrame({'Method':['Ridge'], 'Train Accuracy': ridge_trainscore,'Test Accuracy':ridge_testscore},index={'2'})
results=pd.concat([results,tempresultsdf])
tempresultsdf=pd.DataFrame({'Method':['Lasso'], 'Train Accuracy': lasso_trainscore,'Test Accuracy':lasso_testscore},index={'3'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[325]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
  • The accuray scores of Regression and Ridge model is almost the same in Train and test sets and the performance is better compard to Lasso but lasso have acheived the model score by using 3 less features

Checking weather the model complexity increases the model performance or not

In [48]:
poly=PolynomialFeatures(degree = 2, interaction_only=True)#quadratic with degree 2
X_poly=poly.fit_transform(X)
Xpoly_train,Xpoly_test,ypoly_train,ypoly_test=train_test_split(X_poly,y,test_size=.30,random_state=1)
In [49]:
Xpoly_train.shape
Out[49]:
(721, 37)
In [50]:
lr_model.fit(Xpoly_train,ypoly_train)
print(lr_model.coef_)
[ 5.02544965e-18  7.78268311e-01  5.82866578e-01  2.83711534e-01
 -1.54014067e-01  1.73032754e-01  6.45550769e-02  1.18249307e-01
  6.17939938e-01  5.05390466e-02  7.78386978e-02 -1.71701293e-01
 -1.29899818e-01  2.98435390e-02  4.66417030e-02  1.05544622e-01
  1.02707455e-01 -4.61125408e-02 -4.55809917e-03  3.44137169e-02
  1.27722102e-01  1.96149033e-01 -6.74487664e-02 -1.36880799e-01
  2.63165164e-02  1.21301652e-01  1.58395492e-01  1.00400019e-01
 -5.23705831e-02 -1.17862133e-03 -1.84088254e-02  6.17010077e-02
  2.98334161e-02 -6.76628921e-03  1.00058077e-01  1.53909332e-02
  5.93020528e-02]
In [51]:
#Train accuracy score
lr_model.score(Xpoly_train,ypoly_train)
Out[51]:
0.8044841143233414
In [52]:
#Test accuracy score
lr_model.score(Xpoly_test,ypoly_test)
Out[52]:
0.7747562757353341
In [53]:
ridge.fit(Xpoly_train,ypoly_train)#ridge model
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 0.          0.77115733  0.57509217  0.27757773 -0.1590232   0.17290327
  0.05995112  0.11201165  0.61771285  0.04958149  0.07747198 -0.16948008
 -0.12754684  0.03010105  0.0464961   0.10280458  0.10121101 -0.04507289
 -0.00284898  0.03366509  0.12696361  0.19394823 -0.0653725  -0.1357679
  0.02622534  0.12111767  0.15584753  0.1015438  -0.05137528 -0.00077962
 -0.01966521  0.06251973  0.0310964  -0.00618412  0.10008236  0.01471412
  0.05729425]
In [54]:
lasso.fit(Xpoly_train,ypoly_train)#lasso model
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 0.          0.39652348  0.15170902  0.         -0.12078103  0.17961075
 -0.         -0.          0.39947473  0.         -0.         -0.
 -0.          0.         -0.         -0.         -0.         -0.
  0.         -0.          0.          0.          0.         -0.04090343
 -0.          0.          0.          0.         -0.          0.
 -0.          0.          0.          0.          0.          0.
  0.        ]
In [55]:
print("Ridge Scores: ")
print("train score : ", ridge.score(Xpoly_train,ypoly_train))
print("test score : ", ridge.score(Xpoly_test,ypoly_test))
print()
print("Lasso Scores:")
print("train score : ", lasso.score(Xpoly_train,ypoly_train))
print("test score : ", lasso.score(Xpoly_test,ypoly_test))
Ridge Scores: 
train score :  0.804477994525461
test score :  0.7747084063148023

Lasso Scores:
train score :  0.657424361870979
test score :  0.6470046309081134

Observations

  • The model Performance is slightly increased
  • Quadratic model will be better than a simple linear model
  • The low score may be due to the mix of gaussians ,

Explore for gaussians. If data is likely to be a mix of gaussians, explore individual clusters and present your findings in terms of the independent attributes and their suitability to predict strength

K Means Clustering

Now, we will use K-Means clustering to group data based on their attribute. First, we need to determine the optimal number of groups. For that we conduct the knee test to see where the knee happens.

In [58]:
### Finding Optimal no of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(data_scaled)
    prediction=model.predict(data_scaled)
    meanDistortions.append(sum(np.min(cdist(data_scaled, model.cluster_centers_, 'euclidean'), axis=1)) / data_scaled.shape[0])
                          
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Out[58]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
In [59]:
#Lets try with 3 groups
kmeans_model=KMeans(3)
kmeans_model.fit(data_scaled)
prediction=kmeans_model.predict(data_scaled)
kmeans_df=data_scaled.copy()#creating copy
kmeans_df["Group"]=prediction
kmeans_df.head()
Out[59]:
cement slag ash water superplastic coarseagg fineagg age strength Group
0 -1.339017 1.603837 -0.847144 1.038806 -1.070393 -0.014398 -0.312289 -0.276792 -0.354999 2
1 -1.074790 -0.367612 1.096078 -1.099025 0.812800 1.388141 0.287169 -0.683574 -0.737503 0
2 -0.298384 -0.857572 0.648965 0.277322 -0.111360 -0.206121 1.104041 -0.276792 -0.395168 0
3 -0.145209 0.466016 -0.847144 2.197586 -1.070393 -0.526517 -1.298819 -0.276792 0.601862 2
4 -1.209776 1.271779 -0.847144 0.556375 0.516371 0.958372 -0.963273 -0.276792 -1.050462 2
In [60]:
clusters = kmeans_df.groupby(["Group"])
clusters.mean()
Out[60]:
cement slag ash water superplastic coarseagg fineagg age strength
Group
0 -0.625186 -0.417885 1.135717 -0.292646 0.443136 0.103508 0.237376 -0.070343 -0.242659
1 0.958448 0.488087 -0.405972 -0.866947 1.025055 -0.661010 0.107663 -0.127155 1.121341
2 0.062121 0.119575 -0.795950 0.702171 -0.916055 0.246719 -0.264858 0.127219 -0.359267
In [61]:
kmeans_df.boxplot(by='Group',layout=(3,3),figsize=(15,10));

Let us analyze the Strength column vs other columns group wise.

In [62]:
#cement vs strength
with sns.axes_style("white"):
    plot=sns.lmplot('cement','strength',data=kmeans_df,hue='Group')
    plot.set(ylim=(-3,3))
In [63]:
#slag vs cement
with sns.axes_style("white"):
    plot=sns.lmplot('slag','strength',data=kmeans_df,hue='Group')
    plot.set(ylim=(-3,3))
In [64]:
#ash vs strength
with sns.axes_style("white"):
    plot=sns.lmplot('ash','strength',data=kmeans_df,hue='Group')
    plot.set(ylim=(-3,3))
In [65]:
#superplastic vs strength
with sns.axes_style("white"):
    plot=sns.lmplot('superplastic','strength',data=kmeans_df,hue='Group')
    plot.set(ylim=(-3,3))
In [66]:
#coarseagg vs strength
with sns.axes_style("white"):
     plot=sns.lmplot('coarseagg','strength',data=kmeans_df,hue='Group')
     plot.set(ylim=(-3,3))
In [67]:
#fineagg vs strength
with sns.axes_style("white"):
     plot=sns.lmplot('fineagg','strength',data=kmeans_df,hue='Group')
     plot.set(ylim=(-3,3))
In [68]:
#water vs strength
with sns.axes_style("white"):
     plot=sns.lmplot('water','strength',data=kmeans_df,hue='Group')
     plot.set(ylim=(-3,3))
In [69]:
#age vs strength
with sns.axes_style("white"):
     plot=sns.lmplot('age','strength',data=kmeans_df,hue='Group')
     plot.set(ylim=(-3,3))

Observations from Boxplot and lm plot

  • The groups are overlapping with eachother
  • It is difficult to distinguish the clusters clearly, So it is difficult to pull out a cluster and build a model
  • Hence K-means also does not seem to help our cause
  • ash, coarseagg and fineagg are week contributors

Decision Tree

In [73]:
dt=DecisionTreeRegressor()
dt.fit(X_train,y_train)
Out[73]:
DecisionTreeRegressor()
Feature Importance plot using Decision Tree Regressor
In [130]:
pd.DataFrame(dt.feature_importances_, index = data_scaled.columns[:-1], 
             columns=['Importance']).sort_values('Importance',ascending=False).plot(kind='bar',color='c', figsize=(15,7), title='Feature Importance of Decision Tree')
Out[130]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f91ab875af0>
  • cement,age,water and slag are important attributes
  • coarseagg,fineagg,superplastic and ash are less important so they will not contribute much to the strength
  • we found that coarseagg,fineagg & ash are weak contributors from the correlation map as well as the feature importance in decision tree, Hence we will Drop these columns
In [75]:
#creating copy of the dataset
data_dt=data_scaled.copy()
In [76]:
X=data_dt.drop(['strength','ash','fineagg','coarseagg'],axis=1)
y=data_dt['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
In [77]:
dt2=DecisionTreeRegressor()
dt2.fit(X_train,y_train)
Out[77]:
DecisionTreeRegressor()
In [78]:
y_pred=dt2.predict(X_test)
dt_trainacc=dt2.score(X_train,y_train)
dt_testacc=dt2.score(X_test,y_test)
print('Training Accuracy (DT) : ' ,dt_trainacc)
print('Testing Accuract (DT): ',dt_testacc )
Training Accuracy (DT) :  0.994250323773731
Testing Accuract (DT):  0.8354540744656248
  • The model is overfitting as the accuracy in train data is 99% and in test it is only 85%
In [79]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
In [326]:
tempresultsdf=pd.DataFrame({'Method':['Decision Tree'], 'Train Accuracy': dt_trainacc,'Test Accuracy':dt_testacc},index={'4'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[326]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454

Hyperparameter Tuning with Grid SearchCV

In [111]:
param_grid = {'max_depth': np.arange(3, 6),
             'criterion' : ['mse','mae'],
             'max_leaf_nodes': [100, 105, 90, 95],
             'min_samples_split': [6, 7, 8, 9, 10],
             'max_features':[2, 3, 4, 5, 6]}
grid_tree_dt = GridSearchCV(DecisionTreeRegressor(), param_grid, cv = 10, scoring= 'r2')
grid_tree_dt.fit(X_train, y_train)
print(grid_tree_dt.best_estimator_)
print('Best Score:', np.abs(grid_tree_dt.best_score_))
DecisionTreeRegressor(max_depth=5, max_features=4, max_leaf_nodes=105,
                      min_samples_split=6)
Best Score: 0.7347805390536392
In [114]:
# invoking the decision tree classifier function#criterion = 'mse'

dt3 = DecisionTreeRegressor(criterion = 'mae',max_depth=5,min_samples_split=6,max_leaf_nodes=105,max_features=4)
dt3.fit(X_train, y_train)
Out[114]:
DecisionTreeRegressor(criterion='mae', max_depth=5, max_features=4,
                      max_leaf_nodes=105, min_samples_split=6)
In [119]:
y_preddt=dt3.predict(X_test)
dt3_trainacc=dt3.score(X_train,y_train)
dt3_testacc=dt3.score(X_test,y_test)
print('Training Accuracy (DT Hyperparameter tuning) : ' ,dt3_trainacc)
print('Testing Accuracy (DT yperparameter tuning): ',dt3_testacc )
Training Accuracy (DT Hyperparameter tuning) :  0.7756820725563555
Testing Accuracy (DT yperparameter tuning):  0.7203799432248212
In [120]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_preddt, stat_func=pearsonr,kind="reg", color="k");
In [327]:
tempresultsdf=pd.DataFrame({'Method':['Decision Tree with Hypeparameter Tuning'], 'Train Accuracy': dt3_trainacc,'Test Accuracy':dt3_testacc},index={'5'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[327]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
  • By Hyper Parameter tuning the over fitting is reduced but ith has brought down model performance as well

Random Forest Regressor

In [132]:
#creating copy of the dataset
data_rf=data_scaled.copy()
In [142]:
X=data_rf.drop(['strength'],axis=1)
y=data_rf['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
rf=RandomForestRegressor()
rf.fit(X_train, y_train)
Out[142]:
RandomForestRegressor()
Feature Importance plot using Random Forest Classifier
In [313]:
pd.DataFrame(rf.feature_importances_, index = data_rf.columns[:-1], 
             columns=['Importance']).sort_values('Importance',ascending=False).plot(kind='bar',color='r', figsize=(15,7), title='Feature Importance of Random Forest')
Out[313]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f91b3d80460>
  • cement,age are important attributes
  • coarseagg,fineagg,superplastic and ash are less important so they will not contribute much to the strength
  • we found that coarseagg,fineagg & ash are weak contributors from the correlation map as well as the feature importance in decision tree, Hence we will Drop these columns

Dropping of ash, fineagg and coarseagg attributes

In [244]:
X=data_rf.drop(['strength','ash','fineagg','coarseagg'],axis=1)
y=data_rf['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
In [245]:
rf2=RandomForestRegressor()
rf2.fit(X_train,y_train)
Out[245]:
RandomForestRegressor()
In [246]:
y_pred=rf2.predict(X_test)
rf_trainacc=rf2.score(X_train,y_train)
rf_testacc=rf2.score(X_test,y_test)
print('Training Accuracy (RF) : ' ,rf_trainacc)
print('Testing Accuract (RF): ',rf_testacc )
Training Accuracy (RF) :  0.981858425406705
Testing Accuract (RF):  0.906828693418344
  • The model is overfitting as the accuracy in train data is approximately 98% and in test it is only 91%
In [149]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
In [328]:
tempresultsdf=pd.DataFrame({'Method':['Random Forest'], 'Train Accuracy': rf_trainacc,'Test Accuracy':rf_testacc},index={'6'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[328]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829

Hyperparameter Tuning with Grid SearchCV

In [173]:
param_grid = {'max_depth': np.arange(3, 8),
             'criterion' : ['mse','mae'],
             'max_leaf_nodes': [100, 105, 90, 95],
             'min_samples_split': [6, 7, 8, 9, 10],
             'max_features':['auto','sqrt','log2']}

grid_tree_rf = GridSearchCV(RandomForestRegressor(), param_grid, cv = 5, scoring= 'r2')
grid_tree_rf.fit(X_train, y_train)
print(grid_tree_rf.best_estimator_)
print('Best Score:', np.abs(grid_tree_rf.best_score_))
RandomForestRegressor(max_depth=7, max_leaf_nodes=90, min_samples_split=6)
Best Score: 0.8634183175435434
In [186]:
# invoking the decision tree classifier function#criterion = 'mse'

rf3 = RandomForestRegressor(criterion = 'mse',max_leaf_nodes=90,max_depth=7,min_samples_split=6,)
rf3.fit(X_train, y_train)
Out[186]:
RandomForestRegressor(max_depth=7, max_leaf_nodes=90, min_samples_split=6)
In [247]:
y_predrf=rf3.predict(X_test)
rf3_trainacc=rf3.score(X_train,y_train)
rf3_testacc=rf3.score(X_test,y_test)
print('Training Accuracy (RF Hyperparameter tuning) : ' ,rf3_trainacc)
print('Testing Accuracy (RF yperparameter tuning): ',rf3_testacc )
Training Accuracy (RF Hyperparameter tuning) :  0.9412439336100229
Testing Accuracy (RF yperparameter tuning):  0.880184289576261
In [192]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_predrf, stat_func=pearsonr,kind="reg", color="k");
In [329]:
tempresultsdf=pd.DataFrame({'Method':['Random Forest with Hypeparameter Tuning'], 'Train Accuracy': rf3_trainacc,'Test Accuracy':rf3_testacc},index={'7'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[329]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184

Gradient Boosting Regressor

In [201]:
#creating copy of the dataset
data_gb=data_scaled.copy()
In [204]:
X=data_gb.drop(['strength'],axis=1)
y=data_gb['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
gb=GradientBoostingRegressor()
gb.fit(X_train, y_train)
Out[204]:
GradientBoostingRegressor()
Feature Importance plot using Gradient Boost Regressor
In [314]:
pd.DataFrame(gb.feature_importances_, index = data_gb.columns[:-1], 
             columns=['Importance']).sort_values('Importance',ascending=False).plot(kind='bar',color='g', figsize=(15,7), title='Feature Importance of Gradient Boost Regressor')
Out[314]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f91b3fd3880>
  • cement,age are important attributes
  • coarseagg,fineagg,superplastic and ash are less important so they will not contribute much to the strength
  • we found that coarseagg,fineagg & ash are weak contributors from the correlation map as well as the feature importance in decision tree, Hence we will Drop these columns

Dropping of ash, fineagg and coarseagg attributes

In [248]:
X=data_gb.drop(['strength','ash','fineagg','coarseagg'],axis=1)
y=data_gb['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
In [249]:
gb2=GradientBoostingRegressor()
gb2.fit(X_train,y_train)
Out[249]:
GradientBoostingRegressor()
In [250]:
y_pred=gb2.predict(X_test)
gb_trainacc=gb2.score(X_train,y_train)
gb_testacc=gb2.score(X_test,y_test)
print('Training Accuracy (GB) : ' ,gb_trainacc)
print('Testing Accuract (GB): ',gb_testacc )
Training Accuracy (GB) :  0.9436596412246991
Testing Accuract (GB):  0.8920580705652337
  • The model is overfitting as the accuracy in train data is approximately 94% and in test it is only 89%
In [212]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
In [330]:
tempresultsdf=pd.DataFrame({'Method':['Gradient Boost'], 'Train Accuracy': gb_trainacc,'Test Accuracy':gb_testacc},index={'8'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[330]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058

Hyperparameter Tuning with Grid SearchCV

In [218]:
param_grid = {'n_estimators': [100, 200, 250, 500],
              'max_depth': range(10, 31, 2), 
              'min_samples_split': range(50, 501, 10), 
              'learning_rate':[0.1, 0.2]}
clf = GridSearchCV(GradientBoostingRegressor(random_state = 1)
                   , param_grid, cv = 5, scoring= 'r2').fit(X_train, y_train)
print(clf.best_estimator_) 
print('Best Score:', clf.best_score_)
GradientBoostingRegressor(max_depth=18, min_samples_split=140, n_estimators=500,
                          random_state=1)
Best Score: 0.9215963394593659
In [219]:
# invoking the decision tree classifier function#criterion = 'mse'

gb3 = GradientBoostingRegressor(max_depth=18, min_samples_split=140, n_estimators=500,
                          random_state=1)
gb3.fit(X_train, y_train)
Out[219]:
GradientBoostingRegressor(max_depth=18, min_samples_split=140, n_estimators=500,
                          random_state=1)
In [251]:
y_predgb=gb3.predict(X_test)
gb3_trainacc=gb3.score(X_train,y_train)
gb3_testacc=gb3.score(X_test,y_test)
print('Training Accuracy (GB Hyperparameter tuning) : ' ,gb3_trainacc)
print('Testing Accuracy (GB yperparameter tuning): ',gb3_testacc )
Training Accuracy (GB Hyperparameter tuning) :  0.9906772781802812
Testing Accuracy (GB yperparameter tuning):  0.9316677090318916
In [221]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_predgb, stat_func=pearsonr,kind="reg", color="k");
In [331]:
tempresultsdf=pd.DataFrame({'Method':['Gradient Boost with Hypeparameter Tuning'], 'Train Accuracy': gb3_trainacc,'Test Accuracy':gb3_testacc},index={'9'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[331]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668

Ada Boosting Regressor

In [223]:
#creating copy of the dataset
data_ab=data_scaled.copy()
In [224]:
X=data_ab.drop(['strength'],axis=1)
y=data_ab['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
ab=AdaBoostRegressor()
ab.fit(X_train, y_train)
Out[224]:
AdaBoostRegressor()
Feature Importance plot using Ada Boost Regressor
In [316]:
pd.DataFrame(ab.feature_importances_, index = data_ab.columns[:-1], 
             columns=['Importance']).sort_values('Importance',ascending=False).plot(kind='bar',color='y', figsize=(15,7), title='Feature Importance of Ada Boost Regressor')
Out[316]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f91b3f8e3d0>
  • cement,age & water are important attributes
  • coarseagg,fineagg,superplastic and ash are less important so they will not contribute much to the strength
  • we found that coarseagg,fineagg & ash are weak contributors from the correlation map as well as the feature importance in decision tree, Hence we will Drop these columns

Dropping of ash, fineagg and coarseagg attributes

In [285]:
X=data_ab.drop(['strength','ash','fineagg','coarseagg'],axis=1)
y=data_ab['strength']

# Spliting X&y into training and testing set in the ratio 70:30
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=1)
In [253]:
ab2=AdaBoostRegressor()
ab2.fit(X_train,y_train)
Out[253]:
AdaBoostRegressor()
In [254]:
y_pred=ab2.predict(X_test)
ab_trainacc=ab2.score(X_train,y_train)
ab_testacc=ab2.score(X_test,y_test)
print('Training Accuracy (AB) : ' ,ab_trainacc)
print('Testing Accuract (AB): ',ab_testacc )
Training Accuracy (AB) :  0.8087488353497809
Testing Accuract (AB):  0.7554641302752207
In [230]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
In [332]:
tempresultsdf=pd.DataFrame({'Method':['Ada Boost'], 'Train Accuracy': ab_trainacc,'Test Accuracy':ab_testacc},index={'10'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[332]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
10 Ada Boost 0.808749 0.755464

Hyperparameter Tuning with Grid SearchCV

In [262]:
param_grid = {'n_estimators': [100, 200, 250, 500],
              'loss' : ['linear', 'square', 'exponential'], 
              'learning_rate':[0.1, 0.2]}
clf = GridSearchCV(AdaBoostRegressor(random_state = 1)
                   , param_grid, cv = 5, scoring= 'r2').fit(X_train, y_train)
print(clf.best_estimator_) 
print('Best Score:', clf.best_score_)
AdaBoostRegressor(learning_rate=0.2, loss='square', n_estimators=500,
                  random_state=1)
Best Score: 0.7861149349900618
In [263]:
# invoking the decision tree classifier function#criterion = 'mse'

ab3 = AdaBoostRegressor(learning_rate=0.2, loss='square', n_estimators=500,
                  random_state=1)
ab3.fit(X_train, y_train)
Out[263]:
AdaBoostRegressor(learning_rate=0.2, loss='square', n_estimators=500,
                  random_state=1)
In [264]:
y_predab=ab3.predict(X_test)
ab3_trainacc=ab3.score(X_train,y_train)
ab3_testacc=ab3.score(X_test,y_test)
print('Training Accuracy (AB Hyperparameter tuning) : ' ,ab3_trainacc)
print('Testing Accuracy (AB yperparameter tuning): ',ab3_testacc )
Training Accuracy (AB Hyperparameter tuning) :  0.8187618611755467
Testing Accuracy (AB yperparameter tuning):  0.7758267186095817
In [266]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_predab, stat_func=pearsonr,kind="reg", color="k");
In [333]:
tempresultsdf=pd.DataFrame({'Method':['Ada Boost Boost with Hypeparameter Tuning'], 'Train Accuracy': ab3_trainacc,'Test Accuracy':ab3_testacc},index={'11'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[333]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
10 Ada Boost 0.808749 0.755464
11 Ada Boost Boost with Hypeparameter Tuning 0.818762 0.775827

Bagging Regressor

In [256]:
br=BaggingRegressor()
br.fit(X_train,y_train)
Out[256]:
BaggingRegressor()
In [258]:
y_pred=br.predict(X_test)
br_trainacc=br.score(X_train,y_train)
br_testacc=br.score(X_test,y_test)
print('Training Accuracy (BR) : ' ,br_trainacc)
print('Testing Accuract (BR): ',br_testacc )
Training Accuracy (BR) :  0.9765479494533592
Testing Accuract (BR):  0.8969069566393094
  • The model is overfitting as the accuracy in train data is approximately 98% and in test it is only 90%
In [259]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_pred, stat_func=pearsonr,kind="reg", color="k");
In [334]:
tempresultsdf=pd.DataFrame({'Method':['Bagging Regressor'], 'Train Accuracy': br_trainacc,'Test Accuracy':br_testacc},index={'12'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[334]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
10 Ada Boost 0.808749 0.755464
11 Ada Boost Boost with Hypeparameter Tuning 0.818762 0.775827
12 Bagging Regressor 0.976548 0.896907

Hyperparameter Tuning with Grid SearchCV

In [268]:
param_grid = {'n_estimators': [100, 200, 250, 500],
              'max_features':[2,4,6,8]}
clf = GridSearchCV(BaggingRegressor(random_state = 1)
                   , param_grid, cv = 5, scoring= 'r2').fit(X_train, y_train)
print(clf.best_estimator_) 
print('Best Score:', clf.best_score_)
BaggingRegressor(max_features=4, n_estimators=100, random_state=1)
Best Score: 0.8565345611705929
In [272]:
# invoking the decision tree classifier function#criterion = 'mse'

br2 = BaggingRegressor(max_features=4, n_estimators=100, random_state=1)
br2.fit(X_train, y_train)
Out[272]:
BaggingRegressor(max_features=4, n_estimators=100, random_state=1)
In [273]:
y_predgb=br2.predict(X_test)
br2_trainacc=br2.score(X_train,y_train)
br2_testacc=br2.score(X_test,y_test)
print('Training Accuracy (BR Hyperparameter tuning) : ' ,br2_trainacc)
print('Testing Accuracy (BR yperparameter tuning): ',br2_testacc )
Training Accuracy (BR Hyperparameter tuning) :  0.9669680595411435
Testing Accuracy (BR yperparameter tuning):  0.8777978408251664
In [274]:
sns.set(style="darkgrid", color_codes=True) 
with sns.axes_style("white"):
    sns.jointplot(x=y_test, y=y_predgb, stat_func=pearsonr,kind="reg", color="k");
In [335]:
tempresultsdf=pd.DataFrame({'Method':['Bagging Regressor with Hypeparameter Tuning'], 'Train Accuracy': br2_trainacc,'Test Accuracy':br2_testacc},index={'13'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[335]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
10 Ada Boost 0.808749 0.755464
11 Ada Boost Boost with Hypeparameter Tuning 0.818762 0.775827
12 Bagging Regressor 0.976548 0.896907
13 Bagging Regressor with Hypeparameter Tuning 0.966968 0.877798

KNN Regressor

In [289]:
error=[]
for i in range(1,30):
    knn=KNeighborsRegressor(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i!=y_test))
In [290]:
plt.figure(figsize=(12,6))
plt.plot(range(1,30),error,color='red', linestyle='dashed',marker='o',markerfacecolor='blue',markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean error')
Out[290]:
Text(0, 0.5, 'Mean error')
In [292]:
#k=3
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit(X_train, y_train)
Out[292]:
KNeighborsRegressor(n_neighbors=3)
In [295]:
y_predknn=knn.predict(X_test)
knn_trainacc=knn.score(X_train,y_train)
knn_testacc=knn.score(X_test,y_test)
print('Training Accuracy (KNN ) : ' ,knn_trainacc)
print('Testing Accuracy (BKNN ): ',knn_testacc )
Training Accuracy (KNN ) :  0.914766098106188
Testing Accuracy (BKNN ):  0.8270232556232904
In [297]:
tempresultsdf=pd.DataFrame({'Method':['KNN'], 'Train Accuracy': knn_trainacc,'Test Accuracy':knn_testacc},index={'13'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[297]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
4 Random Forest 0.981879 0.908873
6 Random Forest with Hypeparameter Tuning 0.941244 0.880184
6 Random Forest 0.981879 0.908873
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
6 Gradient Boost 0.943660 0.892058
7 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
6 Ada Boost 0.809441 0.757322
9 Ada Boost 0.809441 0.757322
10 Ada Boost Boost with Hypeparameter Tuning 0.826856 0.787671
7 Bagging Regressor 0.976548 0.896907
10 Ada Boost Boost with Hypeparameter Tuning 0.818762 0.775827
12 Bagging Regressor with Hypeparameter Tuning 0.966968 0.877798
13 KNN 0.914766 0.827023

Hyperparameter Tuning with Grid SearchCV

In [307]:
param_grid = {'n_neighbors' :range(1, 21, 2),
                'weights' :['uniform','distance'],
                'metric' : ['euclidean', 'manhattan', 'minkowski']}
clf = GridSearchCV(KNeighborsRegressor()
                   , param_grid, cv = 5, scoring= 'r2').fit(X_train, y_train)
print(clf.best_estimator_) 
print('Best Score:', clf.best_score_)
KNeighborsRegressor(metric='euclidean', weights='distance')
Best Score: 0.8211638505151282
In [308]:
# invoking the decision tree classifier function#criterion = 'mse'

knn2 = KNeighborsRegressor(metric='euclidean', weights='distance')
knn2.fit(X_train, y_train)
Out[308]:
KNeighborsRegressor(metric='euclidean', weights='distance')
In [310]:
y_predgb=knn2.predict(X_test)
knn2_trainacc=knn2.score(X_train,y_train)
knn2_testacc=knn2.score(X_test,y_test)
print('Training Accuracy (knn Hyperparameter tuning) : ' ,knn2_trainacc)
print('Testing Accuracy (knn yperparameter tuning): ',knn2_testacc )
Training Accuracy (knn Hyperparameter tuning) :  0.994250323773731
Testing Accuracy (knn yperparameter tuning):  0.8457642795676894
In [336]:
tempresultsdf=pd.DataFrame({'Method':['KNN with Hypeparameter Tuning'], 'Train Accuracy': knn2_trainacc,'Test Accuracy':knn2_testacc},index={'14'})
results=pd.concat([results,tempresultsdf])
results = results[['Method', 'Train Accuracy','Test Accuracy']]
results
Out[336]:
Method Train Accuracy Test Accuracy
1 Regression 0.734879 0.738512
2 Ridge 0.734876 0.738682
3 Lasso 0.656046 0.650420
4 Decision Tree 0.994250 0.835454
5 Decision Tree with Hypeparameter Tuning 0.775682 0.720380
6 Random Forest 0.981858 0.906829
7 Random Forest with Hypeparameter Tuning 0.941244 0.880184
8 Gradient Boost 0.943660 0.892058
9 Gradient Boost with Hypeparameter Tuning 0.990677 0.931668
10 Ada Boost 0.808749 0.755464
11 Ada Boost Boost with Hypeparameter Tuning 0.818762 0.775827
12 Bagging Regressor 0.976548 0.896907
13 Bagging Regressor with Hypeparameter Tuning 0.966968 0.877798
14 KNN with Hypeparameter Tuning 0.994250 0.845764
In [357]:
results=results.sort_values("Test Accuracy",ascending=False)
fig=plt.figure(figsize=(15,10))
ax=sns.barplot(y="Method", x=("Test Accuracy"),data=results)

total = len(results["Test Accuracy"])
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width())
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

plt.show()

Observation

* From Above Comparison we found that Random forest & Gradient boost Models Have high Performance compared to other so These algorithms are Suitable for this project

* the Overfitting is reduced while doing Hyper Parameter Tuning

Model performance range at 95% confidence level

Bootstrap Sampling using Gradient Boosting Regressor

In [359]:
data_scaled.head()
Out[359]:
cement slag ash water superplastic coarseagg fineagg age strength
0 -1.339017 1.603837 -0.847144 1.038806 -1.070393 -0.014398 -0.312289 -0.276792 -0.354999
1 -1.074790 -0.367612 1.096078 -1.099025 0.812800 1.388141 0.287169 -0.683574 -0.737503
2 -0.298384 -0.857572 0.648965 0.277322 -0.111360 -0.206121 1.104041 -0.276792 -0.395168
3 -0.145209 0.466016 -0.847144 2.197586 -1.070393 -0.526517 -1.298819 -0.276792 0.601862
4 -1.209776 1.271779 -0.847144 0.556375 0.516371 0.958372 -0.963273 -0.276792 -1.050462
In [360]:
values=data_scaled.values
In [374]:
from sklearn.metrics import r2_score
# Number of bootstrap samples to create
n_iterations = 1000              # Number of bootstrap samples to create
n_size = int(len(data_scaled) * 0.50)    # picking only 50 % of the given data in every bootstrap sample

# run bootstrap
stats = list()
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    # fit model
    model = GradientBoostingRegressor(random_state=1, max_depth=12, 
                                min_samples_split=100, n_estimators=200, 
                                learning_rate=0.2)
    
    model.fit(train[:,:-1], train[:,-1])
    
    # evaluate model
    predictions = model.predict(test[:,:-1])
    score = r2_score(test[:,-1], predictions)    # caution, overall accuracy score can mislead when classes are imbalanced
    print(score)
    stats.append(score)
0.9038266545815575
0.874539367638236
0.8966629528762807
0.8989612600737938
0.8838462951167405
0.8735414865302189
0.8877556800952466
0.9035190654153593
0.893131067336586
0.9076041091293015
0.8847493810287912
0.8965993850585582
0.897601896474527
0.9045314975986309
0.8598264710512997
0.8972082339166957
0.8750462295527839
0.9049422966435408
0.8917337478103821
0.8610421693738401
0.897780837866838
0.848067859927261
0.9020471359560666
0.8904186102890808
0.9042600233787953
0.9137509476349603
0.8996899072760686
0.8943293409536078
0.884793221986766
0.9031403488370491
0.8832045861562495
0.8917078305577335
0.9042299830636248
0.8846189361718273
0.880224206341978
0.8874489907548234
0.9089644885332175
0.8922369308486324
0.8930646573089774
0.8730721815863933
0.8898067804130136
0.890356860616768
0.8917414931787849
0.8959513776378649
0.901997065699217
0.9100547146485268
0.8717046753851106
0.9069505648406765
0.894554115324857
0.907265494269605
0.8988584683927417
0.8914047507812638
0.8890950161305478
0.8894078029454044
0.8748175791423171
0.8953393686780565
0.8961159933856687
0.8941753255134252
0.8836116203603513
0.9089352316940857
0.8827475632577804
0.8969247275328192
0.9074413371800445
0.8812093141780113
0.892590244856588
0.8836929126362296
0.8960993305938499
0.8922741979347896
0.8824422797571264
0.9076797850388209
0.8964475594585373
0.8687840912858835
0.9040824324442315
0.8870663445444444
0.885580279980424
0.8847506447935939
0.8752146827829991
0.8806469642335455
0.9049771533135527
0.8977252598087038
0.869733411093452
0.869927723158993
0.8998138637881061
0.8757534132949099
0.8949000752403015
0.8936388561411535
0.9023450300222611
0.8934678308631177
0.8940431936707615
0.8988080348271136
0.9005570396456116
0.8847932431967226
0.9014160124159524
0.8947882411599578
0.87937136098562
0.8889593582292511
0.9017326436414667
0.893212886781743
0.8713415296179028
0.861886174990529
0.8928483376908223
0.8869430336949577
0.8830293464476188
0.9010979675042632
0.9084194160767356
0.8870387994152888
0.8677969702161473
0.8962164441052858
0.865806276362308
0.9046238278766038
0.9007696993130868
0.90344285881826
0.9046858187309069
0.8766317549298765
0.8976869260235097
0.9024037562739539
0.8992065885757222
0.876400435652575
0.904959790897349
0.893762316756212
0.906272214403589
0.907636372942234
0.8969679310534009
0.8968450090071303
0.8964571407267141
0.916334064828633
0.8858951430622499
0.9050184336795725
0.8655723677446041
0.9090309476494292
0.8996961087496574
0.8827281506135503
0.9041793748588691
0.8988112086679249
0.9079389072320388
0.8983567339711319
0.8864703524379384
0.901768258022712
0.9217418732889596
0.8920749064093226
0.9072015359898054
0.8892398431068209
0.8762113662598794
0.8890186605442656
0.8850285414792031
0.8999415823941531
0.8914465246437451
0.8912960180976807
0.8811079011560183
0.8783394658180702
0.8720628904897496
0.8971539671826432
0.9013582860304052
0.8820144601701616
0.8856607107292226
0.8934856248290488
0.8918979678895553
0.8756939938726007
0.9066745618220979
0.9001203528497111
0.8803273850271227
0.8974415872069224
0.8871331294061935
0.9121800966940105
0.891833968956526
0.9005907394769013
0.9003985910219102
0.8952145480812879
0.904336412785588
0.8769543068702839
0.8809223401774313
0.8781655516711173
0.8898483637481414
0.9008237743787909
0.8963261040923498
0.8797099899477101
0.901734688077182
0.9078458271102379
0.9078454204531178
0.8852690537421003
0.9050894125112752
0.8838595854725427
0.8944499372764259
0.8976623962780321
0.8649075631599444
0.8993823932458462
0.8903329241734612
0.8951666658201803
0.8874235685634558
0.8916738232100381
0.9095420671809039
0.8961116580261423
0.8714896494849683
0.8816017751439313
0.8962794396581808
0.8720291796520138
0.9095452659420005
0.8793802353877687
0.9056590130522537
0.8898601933155534
0.9000098994815472
0.8822411658947631
0.8889865807021393
0.8694262053941934
0.8967908696264448
0.9078159321498444
0.8853962083013184
0.9137471644283419
0.8786693896998478
0.8938908026960907
0.8632199535850487
0.8805029536321295
0.9059115877846242
0.8876897104398929
0.8886741279630281
0.9001334943897676
0.8662311767469044
0.8870326033262032
0.8772609298954708
0.8904531158619017
0.9003842184090637
0.8945308928681595
0.8977584185005214
0.8999430337127127
0.8869618510285376
0.8846594291857204
0.9089519771347596
0.8994507796212339
0.8937192282403883
0.8764654281133515
0.8991676748143513
0.882728363010612
0.8942028931166419
0.8956553573998317
0.8903423688686435
0.8733680924266912
0.8757212221811157
0.8934286534933473
0.8937667109718026
0.8961802530985594
0.8832366531640702
0.905520482985268
0.894677294280813
0.9027830909415504
0.9101955941495625
0.8825522630052214
0.887977776197267
0.9169331540995581
0.9029333290836257
0.8871961303550466
0.8852688791466822
0.8775120583895681
0.8812595967859238
0.8901739654965297
0.8881047273453856
0.8932244314487126
0.8761613913539535
0.8920609633060913
0.8817251690416767
0.8947567245592092
0.9029185764482
0.8883981888297063
0.9045105684566154
0.9014180239111494
0.8896800450544387
0.8932414930866825
0.8979379822050136
0.8857619628669139
0.9087446539431683
0.8994089793225495
0.9133698483821475
0.8851211718085431
0.8848139506626966
0.8992790131784782
0.8949944980025011
0.899643308931562
0.9188291087320402
0.9142622275807558
0.8847198841133922
0.8833746711519087
0.8980632342062037
0.877726484061312
0.9132476138203436
0.909875916938389
0.8997621937160722
0.892058338797479
0.8984232593686377
0.8848773657042764
0.9055246152509716
0.8864376047108008
0.8797287734868061
0.9045714667853861
0.8891357151967803
0.9058697023078579
0.8815427553700046
0.9126427767375146
0.9038939342974123
0.8932640925168501
0.8903603599975474
0.8951512389133838
0.9005065082397272
0.9029067276319439
0.9103335375395323
0.8859660911206061
0.8950160576667359
0.8933143199332054
0.9061945767925191
0.8857580507943216
0.8952618228370751
0.9037154979114598
0.9068421256370107
0.90340343466316
0.8910422004004444
0.8774533676396806
0.9153005461339018
0.9020272845783727
0.9014011479100985
0.8929293674275269
0.8852447610440104
0.8988467157403641
0.8891811817374841
0.9012237189757037
0.8880503207194012
0.9017819079303343
0.8777245823423624
0.9096865088982747
0.9094422112363325
0.8876330348826604
0.9033896721833193
0.8865655803106598
0.876433499968803
0.8981794539268304
0.893099874315944
0.8786402267385728
0.8924696840958899
0.8906227839210187
0.908296591976173
0.9041974071125404
0.8764014586009048
0.8999661925762321
0.9133546359468838
0.9027885760789278
0.9032398287466767
0.8771612326087832
0.9020647647493182
0.8926004702995679
0.8921665652664649
0.9074328371879695
0.8858583680254045
0.8818614904246148
0.8961264353083475
0.8859912723020165
0.8899331101237786
0.8808476254915136
0.883323432843863
0.8913187155216332
0.8836705704434016
0.8607444598236196
0.8960228795809043
0.8920204734572253
0.8817862957283753
0.9147944189879162
0.8806615922953586
0.8915083835504778
0.895995783492971
0.8783092800536472
0.8839138530032138
0.8963566611400761
0.8927402233924365
0.8895259227333654
0.8770280107695777
0.886526396288277
0.8778289345769081
0.8885109864975599
0.8801753310375119
0.9008145442934026
0.8811450704984864
0.8912223300297698
0.9001917536434194
0.8994679201314796
0.9066306615838419
0.8913463453251917
0.8965509861264032
0.9050547993454161
0.8819692850513446
0.8979281214229727
0.903103821930033
0.8898959830216733
0.8842329341690748
0.8976547344547214
0.8934722460297685
0.8933861275675412
0.9004680326223947
0.9052137259720878
0.8753502596339591
0.9013066450695613
0.9054208045107134
0.8960864189344625
0.8825301897593577
0.8775460250417237
0.8885467716392846
0.9012331954643391
0.9029031544446782
0.868435945571615
0.9007789983888179
0.8921485188440357
0.895461909018171
0.9035622622042372
0.8799532246893217
0.9077581076061928
0.8697031911231592
0.8930133623548112
0.8954636690042342
0.8944577278895204
0.901714070518755
0.8973817830444952
0.9106258055436399
0.8984037359393328
0.892954678278046
0.8755602967394696
0.8915060419634818
0.8971189586852706
0.8900559861054332
0.8934987602718026
0.8926861179596454
0.885289557972045
0.8775620485090817
0.8941967407969388
0.8958364419258503
0.892206502085314
0.8951754990911251
0.8936130179084512
0.9159202700250211
0.9011738254929775
0.8862007499244063
0.8820650885565864
0.8742639126339737
0.8941141882651945
0.8886965501728048
0.8817701454947229
0.9033559000207982
0.9031539192139929
0.9168498291909023
0.891227273449378
0.9012002516113511
0.8971245763852083
0.8826240858325896
0.8950463089001883
0.8783160794313298
0.8826943817039615
0.8986637820715573
0.9019220344077215
0.8873760500653626
0.9006997978062216
0.8847721690489452
0.9127571023018286
0.9024305309001839
0.9032355931628646
0.8967759113345963
0.897366032272566
0.8964564096201355
0.9037492497439712
0.9029750136677511
0.8902129087188525
0.8911490020211916
0.897191769926333
0.9022876594574086
0.8924762985268008
0.8983576269459378
0.8967191088627298
0.8807276976947518
0.889340416720341
0.8760045243776994
0.9042111657791152
0.91446393745761
0.8996214238884135
0.9002971638155661
0.891293223211794
0.8840397398072761
0.8853876103942075
0.9125227291409326
0.8742500010149394
0.8828424415953218
0.8714260496706213
0.8687511312761794
0.8852568286469631
0.8943435323040396
0.9018205553944342
0.8909616279238254
0.9051116034779896
0.896424272295377
0.8831101462634339
0.906418430400886
0.8863690012351817
0.8979650790191601
0.8926653610719355
0.895870724346336
0.8937678386505636
0.9035781661040189
0.8799023818791352
0.8984050075390426
0.893198984807115
0.8885439103821403
0.8905576526494046
0.9068094266838279
0.9002846797125641
0.9022227602074422
0.8946567256935142
0.9000948472095514
0.8991063349428952
0.8898699751333795
0.9084467058588889
0.8899728188533528
0.9195942869943291
0.9066878304426089
0.8766523518779725
0.8858505610270461
0.9005257671265974
0.8714073321060225
0.9106722175251215
0.9002069660054932
0.8847150460811459
0.9138122216167337
0.8837437766464188
0.9040545129730647
0.8836077064676073
0.8994134941078585
0.892386892187327
0.8793775028044594
0.9033554371847754
0.8911798020371106
0.90631917698563
0.9067211860917324
0.9032807917568071
0.8985979541861534
0.8941425622339441
0.8995957794389259
0.904229762278672
0.8905707175678405
0.8911655419114428
0.896471508785741
0.8999318119112172
0.8789514521809062
0.8985774503455288
0.8959948262438331
0.9110085832972903
0.8768533895142692
0.8928893029121305
0.8957454740871289
0.8691792755752126
0.8955976666868569
0.8893103568183575
0.88358340827106
0.8961634478583862
0.9031523969603503
0.9003186403355375
0.9023776429308075
0.8914456706072118
0.8996351854855132
0.8993439458010404
0.8667086918323454
0.8967880997666979
0.888508716149642
0.9068094236821242
0.8910059520253251
0.8845928810112983
0.896872396826366
0.9026721443152624
0.8923048775603936
0.9000811392747228
0.9150662439811742
0.8905009926777823
0.8880928341904257
0.8777055682145977
0.8968917615060844
0.8846686832655761
0.9007488313866661
0.8974489311814865
0.907892352258056
0.8942329877417795
0.9015234270432402
0.8957124939336085
0.8706878391361835
0.894825142600872
0.8851278629728797
0.899631047166592
0.9008995240220833
0.9154178587104567
0.8872637087383428
0.8924091696050833
0.8905561026074191
0.900264221290306
0.8845425168531481
0.8839641972030194
0.8978463142208601
0.8996914039698839
0.9007697736374103
0.877487121932429
0.8975819402296479
0.8955152207043761
0.8914466714619571
0.9124897712465764
0.8615603348179144
0.8986161278509865
0.876913210834817
0.8636619718595502
0.8875921076154301
0.8823649139994344
0.8954863666256726
0.8832095519870872
0.8968627345408744
0.9078846589578472
0.8721215526320913
0.9088440377728
0.892644582605867
0.8961043113973066
0.895033316916745
0.8918552963835074
0.8833368752864517
0.8826940617918275
0.8963464243829168
0.900441010594268
0.8991896956043804
0.9010465061770505
0.8800328171109509
0.9020702769159855
0.9037817469043326
0.8907323307446516
0.8891823998774717
0.9015428241734996
0.9069056962159001
0.885004998306128
0.884291995262441
0.9088742320003574
0.8997856234658584
0.8959212938458335
0.8932211713442291
0.8891090557991871
0.8944245308892927
0.9090619049840459
0.8970422868971488
0.9042463981808841
0.9079776089386917
0.876330752316958
0.8961299001770234
0.8980817108835722
0.9008108958153156
0.8914202747115576
0.8798897166379521
0.8846284329892558
0.872338145909798
0.899947991685296
0.8984715189543784
0.8775732279629129
0.9051732065177434
0.8825439635216145
0.8785717142350096
0.9055590841350751
0.8947165576786839
0.9018351501160263
0.878912210397116
0.904354778431886
0.8847410829536559
0.8872918191107799
0.8907604621436762
0.8932550476887432
0.9025716299324446
0.9165111606551387
0.9131121533871173
0.8840926317769173
0.8948208052486888
0.8898997417923342
0.8835207537623146
0.8969892444845086
0.9033006083339207
0.9031504552913053
0.8946047272190288
0.8774197921307058
0.8895725121591826
0.863660621711128
0.9032423218133975
0.8856685520519904
0.9056479913925568
0.8551856911123217
0.8869717971254674
0.8944038358021924
0.8728226591831486
0.892818234090145
0.9075986962564749
0.8916569513669006
0.8857142951620728
0.8895792398141966
0.9062943588047938
0.8743289034377933
0.9041365409913512
0.850435721702399
0.8997295238151188
0.8963628039742423
0.9095731555457448
0.896966529684631
0.8972428527520584
0.8983881111743531
0.8829675065196908
0.9020331487315014
0.8940288364363178
0.8695220963532868
0.9028855746636969
0.8742157300731667
0.8751210415288779
0.8913562454533884
0.8757278842561518
0.890779659237217
0.8883789490042588
0.8940808365536511
0.8880813214118027
0.8866533787197253
0.8900902404795505
0.9001927373548605
0.9010118166892868
0.8788976368860675
0.9104558035049968
0.8878299725361944
0.9057957530857107
0.9082273586022656
0.9036781044194433
0.8903581475781998
0.9074777581402259
0.8896689830861267
0.8882123599127798
0.891440906857724
0.8842920206155189
0.8979998062784007
0.8884663609563535
0.8986099057401074
0.8864300043261834
0.8777527637711735
0.8840454512402104
0.9144593954245548
0.8991762877403155
0.8921280411466539
0.9008709250815005
0.8854929358361204
0.9012532616939368
0.8830593043067655
0.8862298994064975
0.8891799287668404
0.9240413387867342
0.8914648146117139
0.8912079422790726
0.8844537899704776
0.901364403506553
0.9001462142411996
0.8926871212999268
0.8933552891221248
0.8924334155228645
0.8924073964848561
0.8885429992936846
0.8906564765902234
0.9086016124579381
0.9110062627812017
0.8980978695242613
0.9015141372486997
0.9018431643397392
0.8873864373352187
0.8942838548897222
0.8787094871319695
0.8683640593155355
0.9088232347953052
0.8804921470126205
0.9017948525756863
0.892588848470827
0.8853056373978873
0.8953680000646007
0.8788560736302724
0.8816857247958871
0.9089484248486033
0.8994335788097412
0.9050154195296702
0.8792045488726805
0.8954443580110835
0.8870557255186062
0.8814067998922652
0.895845413892945
0.9173209114441968
0.891268080143905
0.8901717224452822
0.8846628360834237
0.8740615766285423
0.9071323748398612
0.8933628281025066
0.8985050719427659
0.8761212263679445
0.9046087951901977
0.8836401030221744
0.8917915275753359
0.8927329763289416
0.910572952120055
0.8879792774697022
0.9018079170346521
0.913288113199358
0.899726381690112
0.8908354282844193
0.9168775505759569
0.8809693872192139
0.8804777048462122
0.8958130492759222
0.8885426340587051
0.9040770138650669
0.8998820425327327
0.8674743874820343
0.8894316934913045
0.8757501987943457
0.9095157148688384
0.8805489638892188
0.8755470024302909
0.8721119893951513
0.8939612971027784
0.9109716535234436
0.8776125386424123
0.8834034065248689
0.8923806511662028
0.8821703200334706
0.8844231529049515
0.9027677021856712
0.8854571284076636
0.9085963998323403
0.9002390543019816
0.9118174015714244
0.8957876634148596
0.8825383285086733
0.8960739837351325
0.9141670977032235
0.8972015321021991
0.8744146820851073
0.8778880584757194
0.9008632438181918
0.8849546695963568
0.9211435680334316
0.8959984205142697
0.8863321573687339
0.9059820098700962
0.8989155294375412
0.887555672189694
0.8867347744001173
0.9060683882012801
0.8827396004591139
0.9132468705100617
0.8636212155915525
0.8837354506911
0.9007376293263899
0.9118953149406928
0.8882700041279141
0.8824878100735561
0.8890766716934564
0.8926943810654111
0.8855499667039002
0.9141732449478434
0.8970377129482596
0.8962832292718138
0.89709757118663
0.9155001368602893
0.8800884152129141
0.8752167454537312
0.890347071871461
0.908764251620326
0.8930267707378801
0.8883470678680627
0.900008177808759
0.8996595923568107
0.8677970141885047
0.8845597243059737
0.914789507578282
0.8887119312653269
0.8969551337361914
0.8939476112457169
0.8772818662476511
0.900689724564177
0.8747947483685389
0.9190048785889128
0.8903479858989224
0.8654856653820037
0.902842322933377
0.9027794652225745
0.8786113326934414
0.8882835824420904
0.8775230620805684
0.8740320667066797
0.8800000505550234
0.9062085457145479
0.8895877901211462
0.9012241712571866
0.8925346057431212
0.8986144696019726
0.8857208906218988
0.906748046152527
0.9022027911177934
0.8825852399108973
0.8984491459767892
0.8984815092761161
0.8922115333333791
0.8975709082912767
0.8976042830038145
0.8905460701422376
0.8795413188355896
0.8772025707057118
0.8964195011179382
0.9005812729207543
0.8966891037970424
0.8764694347191306
0.8889514491341411
0.8921685883632701
0.8973424448777324
0.8940643828844186
0.895427150246083
0.9058041702451065
0.9075918899319818
0.8841359111937183
0.8777609715025876
0.9060221685488803
0.9016539973158034
0.8721763162701202
0.8945123708985037
0.8669852581016982
0.9070048681273191
0.8840455830781987
0.90488735719431
0.8978766593952675
0.8960794649872146
0.8949769906691156
0.8661819163967677
0.907563195776495
0.8802572781322349
0.9006118965972407
0.8906320842934593
0.8912901471026219
0.9014698299238957
0.9082528305868928
0.8982920859330117
0.896074521791878
0.8904365032345346
0.8815964939348168
0.8908413688708273
0.8934647841468263
0.8730741936663127
0.892385556872598
0.8811248631163348
0.8946648325310359
0.9107276217780003
0.879750834430078
0.8925457461750768
0.8990643449170787
0.8954335807120113
0.9150967462117018
0.8943990919177675
0.8965301558004743
0.8923608846439823
0.8841819706164996
0.8766569536086514
0.8881451287659872
0.8840020011758952
0.8916421210300967
0.8687277776614821
0.876963519195596
0.8900623936293091
0.8788417192063314
0.904924187508146
0.8592203318660738
0.8942378916973512
0.9015177342461025
0.9026861902068051
0.8954639841452253
0.8926975574713042
0.8977018797195174
0.8978054314131632
0.9082620479509369
0.9046714260429256
0.8863658585879726
0.9201939962532888
0.9095746087341684
0.8899389632808896
0.890832499496851
0.9235199406557746
0.8838511511682867
0.8946589401498015
0.8910201031012391
0.8879928173873769
0.8912316545298874
0.8973397714390192
0.8800774447927805
0.8960299694941787
0.9076183640786704
0.9091233155906094
0.8944783414630597
0.8965416762000115
0.8917410444247997
0.8739448155653744
0.8735501291745403
0.8730210635953782
0.9100424730709649
0.8931573335938554
In [384]:
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100));
95.0 confidence interval 86.8% and 91.4%

Bootstrap Sampling using Random Forest Regressor

In [388]:
# Number of bootstrap samples to create
n_iterations = 1000              # Number of bootstrap samples to create
n_size = int(len(data_scaled) * 0.50)    # picking only 50 % of the given data in every bootstrap sample

# run bootstrap
stats = list()
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    # fit model
    model = RandomForestRegressor(n_estimators=100)
    
    model.fit(train[:,:-1], train[:,-1])
    
    # evaluate model
    predictions = model.predict(test[:,:-1])
    score = r2_score(test[:,-1], predictions)    # caution, overall accuracy score can mislead when classes are imbalanced
    print(score)
    stats.append(score)
0.8627098300696732
0.8491667863407336
0.8607272042299474
0.8421632598801775
0.842784107800663
0.8617234095405308
0.8512584544146514
0.8276123624501743
0.8725930419569611
0.873822231438247
0.8630128747244007
0.8555529186067135
0.8275203787768599
0.8594897243164105
0.8512387305814867
0.8493197244815044
0.8566362175862529
0.8536557315269264
0.8713662467065677
0.8319760044265079
0.8381572166347049
0.8424300369444145
0.8604433114455816
0.82587031022439
0.8376500064250876
0.8376786177229317
0.8772907310170598
0.8687177195779133
0.8378468811279582
0.8778981382639662
0.8653492228452523
0.8610618469210707
0.8700055658247678
0.8503206186765266
0.8589663497364624
0.8684326304529395
0.8649875361221224
0.8403460484928207
0.884582327274259
0.8627534836665197
0.8457982983647171
0.8630210837613055
0.8585534164941522
0.8764575801622361
0.8489741565732016
0.8548766838663624
0.8256993050988617
0.8641870598167556
0.8842144067277301
0.8697321959588062
0.8611916381291811
0.846638924095465
0.8546926425192322
0.8455050807303892
0.8568695777260218
0.860366108327292
0.8443976160491486
0.844877729676042
0.8412175248180862
0.8399281127997377
0.8565139685726674
0.8593389197316137
0.8390421683764773
0.8605959473902388
0.8535348770356779
0.873407755824083
0.8498326823476569
0.859802307341645
0.8739894455094248
0.883542802598769
0.8496265604474715
0.8529292082103016
0.8440593695863895
0.8694332696254627
0.8842242099536983
0.8582837027610275
0.8420261965212069
0.8574963608963128
0.8421285200193718
0.8554427454227934
0.8494429554689559
0.8016351595964245
0.865516843668564
0.8589098598579881
0.8632405422086465
0.8593767212656516
0.8581689241210758
0.873536474799662
0.8452027543549452
0.8738540955037717
0.8242775751381273
0.8613854753392897
0.8575052534844648
0.8799731242239776
0.8599285809106455
0.8660738070647314
0.8438133680945605
0.8560905361956217
0.8634872219704948
0.8488139660660551
0.8403466645379819
0.8295123496815778
0.8454019038322236
0.8414630836167669
0.8605348196277459
0.8518088219747854
0.8550429996168987
0.8731982370342654
0.8540293939476908
0.8515514075437116
0.8478529914841842
0.8662242181196418
0.8693982751175692
0.8509512795651533
0.8481871120229645
0.84467237779004
0.863572935101582
0.880194770963936
0.8447737062168581
0.8599065542985553
0.8606707096294071
0.8585586743901084
0.8399083447642359
0.8544175365026063
0.8618219573330614
0.8489349540589686
0.8593833960874789
0.8580978412528794
0.8615789906057951
0.8636446040511886
0.8619806377945172
0.860692378077627
0.8698933461816101
0.8524754084540445
0.8715549558978359
0.8623081473258318
0.8604654358977324
0.8424227701575436
0.8217102561662567
0.8591414021538791
0.8798497094095865
0.8492472556086657
0.8556679966703933
0.8611839028172497
0.8664265393986825
0.8579457870129044
0.8464012079973484
0.8799181092680393
0.8748651739741156
0.8587402794661165
0.8472173618368457
0.841888779450554
0.8727406555651339
0.8775530274143307
0.8829618273274726
0.8703233465630796
0.8567192590997657
0.8613030206858838
0.8476450765185513
0.8577108539171527
0.8473584268255678
0.8232582635699052
0.8716149546430263
0.8515924892612614
0.851404233935705
0.8536501011600045
0.8577141030547148
0.8148850597196223
0.8672774804979191
0.8511296755183952
0.870361253141384
0.8425883933697949
0.8535125843684814
0.872158913778976
0.8443955960568866
0.8450452283072658
0.8473353400816301
0.8621370758737285
0.8375682373971756
0.8553431546232848
0.8573400793204773
0.8485533081258616
0.8678706630769775
0.8516545123436383
0.8763908587730401
0.8722274971278631
0.8781738188383681
0.8272823793843487
0.8281505340063587
0.8459919255206685
0.8834192674995875
0.8357059779369777
0.8430207164583782
0.8602659768895542
0.8358620946142832
0.8716337909714345
0.8503788008068779
0.8211322589611496
0.8485967266746115
0.8482620117017416
0.8600473055293193
0.8685950260663802
0.859602278362362
0.875534814995829
0.8690031725816272
0.8526809242606727
0.8568223703272849
0.8593441696316609
0.8570321974633803
0.8734559198805647
0.8550975179063447
0.8547096556679629
0.8379607827655866
0.8573274680341926
0.8597411184922715
0.8282849852728283
0.8765480620946745
0.853781912040999
0.8554597969437147
0.8302000971874606
0.8527195936670502
0.8608103917660309
0.8468158281592337
0.8491176461982146
0.840959373877582
0.8449102539003321
0.8504230374445901
0.8748923573262454
0.870833012429756
0.8176352575738319
0.8738784308702507
0.8704160170100871
0.8587513999962475
0.8248601388608419
0.8273654339273615
0.8593975340174563
0.8558248670909601
0.8583150294648101
0.8539563908593023
0.842737050773101
0.8759431048566831
0.8531391573147746
0.8583958530425386
0.8510278313644599
0.8387150040388077
0.8660548755344871
0.8541763689506148
0.8718398923578888
0.8812544108088192
0.8757175290398281
0.8364997127692797
0.8603490051845403
0.8306052542603135
0.8696519717047207
0.8616360495053523
0.8463422097723908
0.8673464233252389
0.8460756749692401
0.8468776262068013
0.8638614445722672
0.8329438485920453
0.8749757225380512
0.8550435482408886
0.8599348509960962
0.8612110547126571
0.8666738107781412
0.8267625206474181
0.8605389537171713
0.8639567272389737
0.855456969691658
0.8710201745556589
0.8122608570318225
0.8685769290559198
0.867857203944379
0.8633725287256734
0.8646808387721217
0.859524464372645
0.8726034649075849
0.8550281220258955
0.8598232775453498
0.8646215975129203
0.8504164853080439
0.8626501021766482
0.8618438712488135
0.8437297699678459
0.8388914512944569
0.863423975950989
0.8464253709324532
0.8767684011220407
0.8651162128449981
0.851531491194454
0.8557793928899418
0.8661801255399892
0.8386845274899802
0.8510217517332695
0.8794767421104911
0.8421665143648789
0.8315406312951402
0.8443779363580247
0.8676165279973087
0.8716952560819502
0.8556312244894846
0.8547075756845124
0.8509257203857583
0.8587723601972774
0.8309568060925375
0.8637009561317579
0.8401403631426538
0.8486582457817112
0.8759956675368704
0.8643772532635939
0.8415806387021915
0.8489823524608078
0.8424353085660579
0.8596191711483665
0.8769017461619467
0.8419172297091924
0.8430111058647423
0.8194617739024573
0.8320102929772022
0.8703323086162045
0.8504388275214938
0.8737524191822058
0.8661078464121806
0.8723245066217331
0.8849313426130966
0.8663875085745749
0.8746023239511047
0.8495005383106066
0.8694945429988327
0.8684523729721032
0.8444880087369514
0.8864895351899357
0.8707798032348162
0.8443081758564542
0.8672306247129223
0.850229530828858
0.8356630444288253
0.848713780784368
0.8343346645372248
0.8452922703331758
0.8504974495728064
0.8469765094368404
0.8435950171310421
0.8441520221489336
0.8684563528004889
0.8617843666889979
0.8734286218081206
0.8712466617738984
0.8628627805080544
0.8474537282489495
0.8368482081912219
0.8639214902068851
0.8780612428358241
0.8557599857913804
0.8501823986185563
0.8512247049196977
0.8197832234686133
0.8351518513662828
0.8409629155179981
0.8676131651673165
0.8608346593917177
0.8621206220545035
0.8706731377005156
0.8788190535324111
0.8429610552789261
0.8676908319852642
0.8356611897528712
0.8585603116675622
0.8704127802154725
0.8475096526524138
0.8562496877796836
0.8459084483066113
0.8735966431747586
0.8686073814336219
0.8570152247993061
0.8427550911513833
0.8655437086866326
0.8867609968550716
0.8469842037943285
0.8626329320244762
0.8791020996453782
0.8497694385392329
0.848545707747203
0.8540329510812963
0.8733294098700208
0.8757263897602504
0.8579215426621047
0.8265961844373028
0.8507669400229084
0.8716922194836729
0.876753219473422
0.8183532220186649
0.844196382912978
0.8761153132178878
0.8750341140229485
0.8696224437680162
0.8556091756455864
0.8554865832420404
0.8758673582637515
0.8460232829063525
0.8514426700168156
0.8433853929551303
0.8673065164215198
0.8575187101604633
0.8681386103201274
0.8470183872204162
0.8636860420543435
0.8912608167513063
0.8634360797021718
0.8332817551558572
0.8560919290254839
0.8642078841994585
0.8312513815356204
0.8469287530316326
0.8668427115652386
0.8615025590362759
0.8863430806449216
0.8700937327514333
0.8805800378225863
0.8549125597758399
0.8416187685626886
0.86246269649982
0.844145127661241
0.8492422178916742
0.8728828776535238
0.8451815925750469
0.8555803535897439
0.8290113792557134
0.8536853168050094
0.8928036172458673
0.8563822405029771
0.8284189727069757
0.8421395048905504
0.8651598509353842
0.8636535838819572
0.8585762755190117
0.8661664874005621
0.8380741598921706
0.8655143612314286
0.868182780258078
0.87569116338563
0.8593060277017446
0.8517253524304561
0.8646540825711551
0.8885490131085098
0.8440032155269487
0.8779434039939862
0.8585819274758945
0.8318954095962524
0.8627266159063369
0.8482828005159958
0.8578558679568207
0.8768247517235279
0.8399792665864408
0.8730048891534942
0.8654432093287551
0.8384409820675057
0.8462831251424605
0.8611461774680674
0.8608934852385206
0.8618260570026308
0.8820035097848683
0.8623150979434594
0.8517091569556104
0.8663977711849407
0.8562784455661065
0.8653515330991218
0.8527344323476594
0.8509309852976198
0.8630772410612322
0.8395008387211638
0.8671738979050652
0.8731910958466156
0.846811503342833
0.8628038278672351
0.8562714135992524
0.8608532051580196
0.8308191228826992
0.8894435772249785
0.8546574887494284
0.8620109559258194
0.8440894218554327
0.8376081356613008
0.858850608253911
0.8566600412964084
0.8859719272424251
0.8802882192287136
0.859084136996099
0.8453696689490461
0.8612619852444013
0.8677557599470797
0.8560973479816889
0.863969009335275
0.8603437400235203
0.8238716117411746
0.8640844343726793
0.8392381601436246
0.8473605373408802
0.8413059484838433
0.8725835027991168
0.876872034471406
0.8484497221132263
0.8633149516301614
0.852616032464381
0.859731285657716
0.8509895987994395
0.8736848175788935
0.826276907769496
0.8163883753693155
0.8586558475539603
0.8567956994084557
0.8523105281887455
0.855454816550155
0.8436996743062302
0.8414106211113743
0.8584569277964007
0.8616029829735596
0.873864187438444
0.8526259937686923
0.8614355314075247
0.8418116053120064
0.8332267291774297
0.8463099374016348
0.8654019719676342
0.8571498100297894
0.8709230776418587
0.8070809874317642
0.8552921848792672
0.8811964259863981
0.8535352094850392
0.8590748077240802
0.8511804024470981
0.885317987340624
0.8417331457822966
0.8741861461557087
0.8500244427506534
0.881759539269509
0.8512660834306303
0.862049328753238
0.8413111740867498
0.8679878885539669
0.8578726390547141
0.8395203607694812
0.8669303709062197
0.8643618485400202
0.8618513577147418
0.853547285357525
0.8855981897858807
0.8489828359157283
0.852222489678238
0.8566749103625024
0.8577366725938088
0.8488508852771898
0.8609627498651022
0.8639458564666113
0.859707279736864
0.8653529481261231
0.8163417693494696
0.8798596353608443
0.8698356854675692
0.8766184934107406
0.8461298928432675
0.8361932311458959
0.8225924868116941
0.8626154447056125
0.8545491929819072
0.8555713724983844
0.8864596717916149
0.8755062811741063
0.8708493590849966
0.8778502730564803
0.8458945939839964
0.8935174584982863
0.8667008325062804
0.8322930386770865
0.8516212006781569
0.8647402092404226
0.8610589851915487
0.8628528428507138
0.8706108769674776
0.856299417146746
0.8613091117050897
0.8425807195924536
0.848430889185238
0.8711817559903018
0.8651018880726404
0.8693379158551086
0.8640832562636309
0.8432593166794335
0.8246279720299945
0.8581928242464539
0.8631082679830537
0.8643491046623187
0.8825005752532039
0.8545033490171434
0.8633259011806427
0.8653598620521251
0.8776803011707247
0.8582621675123867
0.8682928693119811
0.8610645831936045
0.8773176721607537
0.8534709497183439
0.8833433874958604
0.8463841965401743
0.8664726249874987
0.8516564684279965
0.8675697852850774
0.8513277569724776
0.8640967528814172
0.8638045559011807
0.8704969449520472
0.8388746995950307
0.8402904055262321
0.8687096785843083
0.8667628344044888
0.8760521373950476
0.8232805071857734
0.8833289016640994
0.87462559743688
0.866860287203461
0.8734122306683474
0.8616699376130021
0.8462252346083032
0.8325187323530417
0.8583553949750408
0.8367035074987941
0.8599580148397283
0.8404781421405587
0.8536649488646659
0.8550569405525921
0.8563741501568187
0.8793033442927088
0.8286740501379994
0.8664506935087898
0.8702540335471391
0.847783164801658
0.8645291456280574
0.8937019572950311
0.8552137786731898
0.8675777888072064
0.8453845202821098
0.8471552276823984
0.8454546531856336
0.8634012360988293
0.8534139816605573
0.8616998021404816
0.8624543065030187
0.8473647048227214
0.8519717475377031
0.8512482081059816
0.8514141645929322
0.8485577946285409
0.8347411207313902
0.8422448208043377
0.866212492767952
0.8460902147543524
0.8699904748489795
0.85653249760577
0.8596756284190421
0.8589950963923557
0.8575677272732326
0.8682776251109676
0.8253794746851992
0.8711131749570709
0.8754891324723476
0.8483479839901015
0.8569031199203945
0.8622049052308888
0.8511974049329568
0.8646451574602146
0.8470443733915615
0.8544955692245906
0.843062305544453
0.8482973222072325
0.826984215006576
0.8496346003786419
0.8526352339591738
0.8524163288590011
0.8686828555357269
0.8471898498150907
0.8680584757941573
0.8312563969196345
0.8804382435077892
0.8645616000000218
0.8360955683932314
0.854720223231757
0.868568893766516
0.864595727660844
0.8625237579531704
0.8632037785624939
0.8522164806471891
0.8380031979000917
0.8458631053072574
0.848628609386485
0.8572771981792318
0.86045721364273
0.8678063057297272
0.864389204408684
0.844877900646049
0.83693155206949
0.8286434318946428
0.8608115896408387
0.8473788060419071
0.8589750429582665
0.8638152998259125
0.8405704191931684
0.8668568523817952
0.8522633362427666
0.8411045133372952
0.8720145784034189
0.8361766277050343
0.8436522036368748
0.8692068212182082
0.8689007993053712
0.8743314568893004
0.8319407910081755
0.8837570915027856
0.8484746860554562
0.8752010558433431
0.8550755917539086
0.847562454665933
0.8641288796071255
0.8760343907881813
0.8576026762327571
0.8809718030626008
0.8432850861033971
0.8490930972653405
0.8196975156192126
0.881798681356418
0.8138272951557666
0.8223749217751548
0.8600429834727065
0.866013328927618
0.8629731743051055
0.8542654518116166
0.8383655203287284
0.869895938165286
0.8873377633994468
0.8636807105682265
0.8607716214080819
0.867575907660201
0.8647616720921905
0.8537648527587228
0.8542820177596688
0.8659836382503692
0.8521852472296474
0.8530866695981358
0.8717111438349427
0.8480184543374852
0.8650026357020333
0.8364077321857983
0.8283511998416474
0.8908094374447695
0.8575316918410577
0.8605886110290344
0.8777545479767392
0.8578782866669019
0.8524504427282443
0.8455732585652054
0.8759490409875001
0.824838581981562
0.8846886241740326
0.8565656631088256
0.8576459163213336
0.8612410228765248
0.8739713018963958
0.8502769180732018
0.8652334188592861
0.8538074804710389
0.845196708235546
0.8433850698131687
0.8443716008069964
0.8636129295471815
0.8744729431499303
0.8377023139179052
0.8122090955421221
0.8634917390598349
0.8727277348763797
0.8680229363807812
0.8892505471586231
0.8452864407193641
0.8418954381002531
0.8610023467718397
0.8377951443776729
0.8647645046608653
0.883118235131197
0.8485193760870051
0.8808916713002499
0.8214845977249623
0.8597229458265273
0.8635317208967196
0.8740946807723038
0.8638631058347277
0.8399954951522117
0.8652649529705272
0.8421676709080045
0.8435565291405217
0.8548519083562459
0.8403512885886272
0.864601727660383
0.8569832992351702
0.8431336732811887
0.8818860952936459
0.8670737522761609
0.8513146827728635
0.8633828583672334
0.8486742115876653
0.8670858311272422
0.8595452817371004
0.8700991631000924
0.8571550957251797
0.8419443133114238
0.8781419412234421
0.8714721556530578
0.8573414465624423
0.8586532738703513
0.8779789381786921
0.8490195077640759
0.8397313897433218
0.872342180896973
0.8656170376629675
0.8494635685592222
0.8343439428105621
0.8448802235197579
0.8772609238906696
0.8433692907322554
0.8646100154221354
0.8536190074650176
0.8315492928515776
0.840704268899848
0.8483864363236772
0.8267463498830002
0.8632689877814113
0.8810627505660459
0.8489951380630079
0.8806041943197709
0.870437292872608
0.8551685884738695
0.8685319142589657
0.8618968700715286
0.8538552354128816
0.8586581634526744
0.860488724157602
0.8472462001733359
0.8751439087279117
0.8615379626812988
0.8510952312712443
0.8561905231261056
0.8320175323895884
0.8515449451662326
0.8306937389463663
0.8712625106006673
0.8542798279872582
0.8671819010977426
0.8255604724885564
0.8584186359470772
0.8305973252390426
0.8712032049022417
0.84699793826737
0.8617579934325573
0.8477090349147827
0.8684001664848697
0.8790876166283154
0.8739249734600038
0.8203728635451915
0.8786137799723683
0.8568669188599303
0.8887706369268398
0.8972813883638932
0.8702119427286243
0.8502852045589537
0.8544145017736777
0.8549636670640213
0.8594320816200153
0.8384724284973
0.8431846076152297
0.8664108766112513
0.875306777303228
0.8532376502626404
0.8590633101262086
0.8570900945385928
0.8454519126608584
0.8588404454113518
0.843235333752234
0.8624919095746888
0.8645680398457118
0.8597575374326607
0.8670147104478074
0.85531523128848
0.8608783055734446
0.8757418220084997
0.8654174482976931
0.8738439247844677
0.8550110984435239
0.8559741738651745
0.8579811397217705
0.8598779520736762
0.8803937264691947
0.871828114324078
0.8609352275458267
0.8453226888828874
0.8473769449708062
0.873973573369626
0.870310945132333
0.8512878899785358
0.8467184639321126
0.8530547141714139
0.8574256678010883
0.8734013108489138
0.8561378201346661
0.8503714544500747
0.8834730150250615
0.8368097293156878
0.8455335475977327
0.853636872680293
0.8503135602390595
0.8509952995203431
0.8228928390152835
0.8520127080512164
0.8409652855938801
0.8637929665013682
0.8785064775741895
0.8455578539541281
0.8439646478423914
0.8880603563871046
0.892760691435011
0.8550979180706776
0.8551406328603119
0.8563709542954463
0.845474856451643
0.8674753083932535
0.8459807622737601
0.8615969040166237
0.8746568351651469
0.8445351442029804
0.8455794234857474
0.8737873287515239
0.8539240601302741
0.8576557887704009
0.8538989832720953
0.8509999353295057
0.8629209292092098
0.8687814872430446
0.8688488421346293
0.8578229318727283
0.8332160133511035
0.8652087136671504
0.8438961061725182
0.8680580704402968
0.8538519977982293
0.8436288196068877
0.8678384600958128
0.8592448594612916
0.8614505994910688
0.8750100801127144
0.8451113021911254
0.8648551814315725
0.86230675609409
0.840200970439489
0.8390503944297297
0.852655006600683
0.8589151734558957
0.8702538633794702
0.8582691275087675
0.8639114012596362
0.8438560891757938
0.850184435374606
0.8565023335951054
0.852785981598496
0.8434447626865771
0.8268606955464259
0.8442224928210577
0.8717896218611101
0.8483023552919818
0.8459706697095688
0.8682450282639723
0.8822466420429347
0.8502010730823966
0.8402971056615534
0.885110917450392
0.8528392701054054
0.8641490954141966
0.847651382590392
0.8530376510765227
0.8387163858690402
0.8713221727873566
0.8856079289257197
0.8631901943833854
0.8692438085496847
0.8400028037648646
0.8457998073075683
0.846961991020037
0.84203441758934
0.8703972888665588
0.833974823753196
0.8548406013797117
0.8320218846457962
0.8568309593305582
In [389]:
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
95.0 confidence interval 82.5% and 88.4%

Conclusion

From above results we found that

  • Gradient Boosting Regressor and Random Forest Regressor in Performing well on training and testing data compared to other models
  • At 95% confidence interval Gradient boosting model gave the best performance between 86.8% and 91.4% which is better than Random forest which is between 82.5 and 88.4%
  • Feature importance were calculated using decision trees, Random forest regressor, Grandient boost regressor and Ada boost regressor.most of them shown Age & Cement as the Important Feature

  • Outliers where Identified using boxplot and was replaced by median

  • No missing values are present in the model
  • Hyper parameter tuning has reduced the Overfitting of models
  • K-Means clusturing was not sucessfull as the clusters where overlaping
  • PCA is not usefull for models with less attributes
  • Complex models will give more Accuracy compared to simple linear models
In [ ]: